161 research outputs found

    MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

    Get PDF
    Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001

    Massively-Parallel Feature Selection for Big Data

    Full text link
    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of pp-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

    Report on the First International Workshop on Personal Data Analytics in the Internet of Things (PDA@IOT 2014)

    Get PDF
    International audienceThe 1st International Workshop on Personal Data Analytics in the Internet of Things (PDA@IOT), held in conjunction with VLDB 2014, aims at sparking research on data analytics, shifting the focus from business to consumers services. While much of the public and academic discourse about personal data has been dominated by a focus on the privacy concerns and the risks they raise to the individual, especially when they are seen as the new oil of the global economy. PDA@IOT focus on how persons could effectively exploit the data they massively create in CyberPhysicalworlds. We believe that the full potential of the IoT goes far beyond connecting “things” to the Internet: it is about using data to create new value for people. In a People-centric computing paradigm, both small scalepersonal data and large scale aggregated data should be exploited to identify unmet needs and proactively offerthem to users. PDA@IOT seeks to address current technology barriers that impede existing personal dataprocessing and analytics solutions to empower people in personal decision making.The PDA@IOT ambition is to provide a unique forum for researchers and practitioners that approach personal data from different angles, ranging from data management and processing, to data mining and human-data interaction, as well as to nourish the interdisciplinary synergies required to tackle the challenges and problems emerging in People-centric Computing

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    Scheduling of Continuous Operators for IoT edge Analytics with Time Constraints

    Get PDF
    International audienceData stream processing and analytics (DSPA) engines are used to extract in (near) real-time valuable information from multiple IoT data streams. Deploying DSPA applications at the IoT network edge through Edge/Fog architectures is currently one of the core challenges for reducing both network delays and network bandwidth usage to reach the Cloud. In this paper, we address the problem of scheduling continuous DSPA operators to Fog-Cloud nodes featuring both computational and network resources. We are paying particular attention to the dynamic workload of these nodes due to variability of IoT data stream rates and the sharing of nodes' resources by multiple DSPA applications. In this respect, we propose TSOO, a resource-aware and time-efficient heuristic algorithm that takes into account the limited Fog computational resources, the real-time response constraints of DSPA applications, as well as, congestion and delay issues on Fog-to-Cloud network resources. Via extensive simulation experiments, we show that TSOO approximates an optimal operators' placement with a low execution cost

    Querying Temporal Drifts at Multiple Granularities

    Get PDF
    There exists a large body of work on online drift detection with the goal of dynamically finding and maintaining changes in data streams. In this paper, we adopt a query-based approach to drift detection. Our approach relies on a drift index, a structure that captures drift at different time granularities and enables flexible drift queries. We formalize different drift queries that represent real-world scenarios and develop query evaluation algorithms that use different mate-rializations of the drift index as well as strategies for online index maintenance. We describe a thorough study of the performance of our algorithms on real-world and synthetic datasets with varying change rates

    Scheduling Continuous Operators for IoT Edge Analytics

    Get PDF
    International audienceIn this paper we are interested in exploring the Edge-Fog-Cloud architecture as an alternative approach to the Cloud-based IoT data analytics. Given the limitations of Fog in terms of limited computational resources that can also be shared among multiple analytics with continuous operators over data streams, we introduce a holistic cost model that accounts both the network and computational resources available in the Edge-Fog-Cloud architecture. Then, we propose scheduling algorithms RCS and SOO-CPLEX for placing continuous operators for data stream analytics at the network edge. The former dynamically places continuous operators between the Cloud and the Fog according to the evolution of data streams rates and uses as less as possible Fog computational resources to satisfy the constraints regarding the usage of both computational and network resources. The latter statically places continuous operators between the Cloud and the Fog to minimize the overall computational and network resource usage cost. Based on thorough experiments, we evaluate the effectiveness of SOO-CPLEX and RCS using simulation

    Efficient Scheduling of Streaming Operators for IoT Edge Analytics

    Get PDF
    International audienceData stream processing and analytics (DSPA) applications are widely used to process the ever increasing amounts of data streams produced by highly geographical distributed data sources such as fixed and mobile IoT devices in order to extract valuable information in a timely manner for real-time actuation. To efficiently handle this ever increasing amount of data streams, the emerging Edge/Fog computing paradigms is used as the middle-tier between the Cloud and the IoT devices to process data streams closer to their sources and to reduce the network resource usage and network delay to reach the Cloud. In this paper, we account for the fact that both network resources and computational resources can be limited and shareable among multiple DSPA applications in the Edge-Fog-Cloud architecture, hence it is necessary to ensure their efficient usage. In this respect, we propose a resource-aware and time-efficient heuristic called SOO that identifies a good DSPA operator placement on the Edge-Fog-Cloud architecture towards optimizing the trade-off between the computational and network resource usage. Via thorough simulation experiments, we show that the solution provided by SOO is very close to the optimal one while the execution time is considerably reduced
    • …
    corecore